NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Enabling Unstructured Sparse Fine-Tuning and Inference for Foundation Models on Wafer-Scale Engine

https://doi.org/10.1145/3731599.3767395

Zheng, Haoyu; Zeng, Yifan; Song, Linghao; Emani, Murali; Dong, Wenqian (November 2025, ACM)

Free, publicly-accessible full text available November 15, 2026
TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs

https://doi.org/10.1145/3620666.3651347

Prakriya, Neha; Chi, Yuze; Basalama, Suhail; Song, Linghao; Cong, Jason (April 2024, ACM)

Full Text Available
NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing

https://doi.org/10.1109/ISCA59077.2024.00035

Wang, Yitu; Li, Shiyu; Zheng, Qilin; Song, Linghao; Li, Zongwang; Chang, Andrew; Li, Hai “Helen”; Chen, Yiran (June 2024, IEEE)

Approximate nearest neighbor search (ANNS) is a key retrieval technique for vector database and many data center applications, such as person re-identification and recommendation systems. It is also fundamental to retrieval augmented generation (RAG) for large language models (LLM) now. Among all the ANNS algorithms, graph-traversal-based ANNS achieves the highest recall rate. However, as the size of dataset increases, the graph may require hundreds of gigabytes of memory, exceeding the main memory capacity of a single workstation node. Although we can do partitioning and use solid-state drive (SSD) as the backing storage, the limited SSD I/O bandwidth severely degrades the performance of the system. To address this challenge, we present NDSEARCh, a hardware-software co-designed near-data processing (NDP) solution for ANNS processing. NDSeARCH consists of a novel in-storage computing architecture, namely, SEARSSD, that supports the ANNS kernels and leverages logic unit (LUN)-level parallelism inside the NAND flash chips. NDSEARCH also includes a processing model that is customized for NDP and cooperates with SearSSD. The processing model enables us to apply a two-level scheduling to improve the data locality and exploit the internal bandwidth in NDSearch, and a speculative searching mechanism to further accelerate the ANNS workload. Our results show that NDSEARCH improves the throughput by up to 31.7×,14.6×,7.4×, and 2.9× over CPU, GPU, a state-of-the-art SmartSSD-only design, and DeepStore, respectively. NDSEARCH also achieves two orders-of-magnitude higher energy efficiency than CPU and GPU.
more » « less
Full Text Available
ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers

https://doi.org/10.1145/3581784.3607077

Song, Linghao; Chen, Fan; Li, Hai; Chen, Yiran (November 2023, ACM)

Full Text Available
TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

https://doi.org/10.1145/3609335

Guo, Licheng; Chi, Yuze; Lau, Jason; Song, Linghao; Tian, Xingyu; Khatti, Moazin; Qiao, Weikang; Wang, Jie; Ustun, Ecenur; Fang, Zhenman; et al (December 2023, ACM Transactions on Reconfigurable Technology and Systems)

In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allows users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments, we make the originally unroutable designs achieve 274 MHz, on average. The framework is available athttps://github.com/UCLA-VAST/tapaand the core floorplan module is available athttps://github.com/UCLA-VAST/AutoBridge
more » « less
Full Text Available
Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

https://doi.org/10.1145/3543622.3573182

Song, Linghao; Guo, Licheng; Basalama, Suhail; Chi, Yuze; Lucas, Robert F.; Cong, Jason (February 2023, Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '23))

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94×, 3.36× higher throughput, and 2.94× better energy efficiency. Compared to an NVIDIA A100 GPU which has 4× the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34× higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.
more » « less
Full Text Available
PYXIS: An Open-Source Performance Dataset Of Sparse Accelerators

https://doi.org/10.1109/ICASSP43922.2022.9746473

Song, Linghao; Chi, Yuze; Cong, Jason (May 2022, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

Full Text Available
Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

https://doi.org/10.1145/3490422.3502357

Song, Linghao; Chi, Yuze; Sohrabizadeh, Atefeh; Choi, Young-kyu; Lau, Jason; Cong, Jason (February 2022, FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays)

Full Text Available
PARC: A Processing-in-CAM Architecture for Genomic Long Read Pairwise Alignment using ReRAM

https://doi.org/10.1109/ASP-DAC47756.2020.9045555

Chen, Fan; Song, Linghao; Li, Hai; Chen, Yiran (January 2020, 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC))

Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore's Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pairwise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tuple anchors as inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.
more » « less
Full Text Available
A Survey of Accelerator Architectures for Deep Neural Networks

https://doi.org/10.1016/j.eng.2020.01.007

Chen, Yiran; Xie, Yuan; Song, Linghao; Chen, Fan; Tang, Tianqi (March 2020, Engineering)

Full Text Available

« Prev Next »

Search for: All records